Part 1 - Exploratory Data Analysis

Total number of samples: 10,000

We begin the study by analyzing each of the features present individually for

Then we can start looking at features cross-sectionally.

NOTE: If this were to be placed in production, we'd have to take into account the fact that some features can be discriminatory against a client. Features such as sex and age, for example, shouldn't be taken into account when predicting defaults. However, for this exercise we will proceed disregarding this note.

Data Description

Data Definition Value(s)
ID ID of each client
LIMIT_BAL Amount of given credit in NT dollars (includes individual and family/supplementary credit
SEX Gender (1=male, 2=female)
EDUCATION Education (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
MARRIAGE Marital status (1=married, 2=single, 3=others)
AGE Age in years
PAY_0 Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
PAY_2 Repayment status in August, 2005 (scale same as above)
PAY_3 Repayment status in July, 2005 (scale same as above)
PAY_4 Repayment status in June, 2005 (scale same as above)
PAY_5 Repayment status in May, 2005 (scale same as above)
PAY_6 Repayment status in April, 2005 (scale same as above)
BILL_AMT1 Amount of bill statement in September, 2005 (NT dollar)
BILL_AMT2 Amount of bill statement in August, 2005 (NT dollar)
BILL_AMT3 Amount of bill statement in July, 2005 (NT dollar)
BILL_AMT4 Amount of bill statement in June, 2005 (NT dollar)
BILL_AMT5 Amount of bill statement in May, 2005 (NT dollar)
BILL_AMT6 Amount of bill statement in April, 2005 (NT dollar)
PAY_AMT1 Amount of previous payment in September, 2005 (NT dollar)
PAY_AMT2 Amount of previous payment in August, 2005 (NT dollar)
PAY_AMT3 Amount of previous payment in July, 2005 (NT dollar)
PAY_AMT4 Amount of previous payment in June, 2005 (NT dollar)
PAY_AMT5 Amount of previous payment in May, 2005 (NT dollar)
PAY_AMT6 Amount of previous payment in April, 2005 (NT dollar)
default.payment.next.month Default payment (1=yes, 0=no)

A brief look lets us categorize the features as follows.

CATEGORICAL:

CONTINUOUS:

Feature Correlations

Let's see how correlated the variables are to each other and to the target default.

Immediate observations are:

Duplicates

It is sensible for us to see if there are any duplicates in the data.

It is difficult to tell whether these are by accident or not, as these come from cards that were seldom used. Therefore, we will not remove the duplicated values.

Null Values

Only PAY_1 has nulls.

100% of PAY_1s are False. Good indicator of a no-default. Therefore, we fill the PAY_1 NaNs with 1s, as all the values that do exist are 0.

Bills Paid = 0 and PAY_AMT = 0

If an individual has Bill Paid = 0 and Pay Amount = 0 for all time periods, the individual either didn't use the card, or paid the balance immediately (which might as well mean that they didn't use credit). This corresponds to PAY_ of -2.

So, what does it mean when an individual defaults but didn't use the card? Likely faulty data...?

Negative Valued Bill_AMTs or PAY_AMTs

These may be from people that either get a refund OR pay their entire balance ahead of time. We explore the former.
Of these users, how many default?

As expected, users that never have a negative balance have a higher proportion of defaults than users who sometimes have a negative balance. Paying extra or getting refunds helps determine defaults.

How many users pass their limit balance X times?

## It is quite significant...
Of these users, what are the odds they default if they pass their limit balance X times?

As expected, the non-defaults have a higher chance of not exceeding their limit balance than the defaults. Specifically, of the non-defaulted accounts, 88.9% don't exceed their limit balance whereas only 81% of defaulted users do.

Helper Functions

Age

* Immediately obvious - all 100 18 year olds paid their dues. * Distributions are slightly skewed - Individuals that default are younger than those that pay. * There seem to be spikes in the Paid distribution so we will inspect further..
* These spikes fade away when we use more granular bins, so can be safely ignored.

Categorical Vars

Education

For some reason, Education has 4, 5, 6, and 0 as other, even though there are 4 distinct categories. We relabel these as 4, as per the description. We could leave them as they may be useful for our model, but because we strive to have model explainability, we'll label them all as "other".

Marriage

Marriage has a value of 0 which is not in the description. For sake of explainability, we'll also label it as 'other'.

Pay_1

PAY_

Not sure why only PAY_5 and PAY_6 have values of 1. Also not sure what values of 0 mean, as -1 is to pay duly and 1 is to be delayed for one month. We can assume a value of 0 means it is paid when requested.

Pay_AMTs

Bill Amounts

Notice the log distributions of BILL_AMTs and PAY_AMTs

Limit Balance

Multiple Feature Explorations

## Are large payments done by the same people month on month? Also, are there any outliers / messy data?

A lot of people aligned along the x or y axis tells us that these are huge expenses just done on a single month. The dots in the center of the graph are indeed consumers who have large payments. Additionally, 99th percentile of spenders seldom defaults.

The thick line around 20 degrees are those that pay what they owe - mostly non default. Another line arises that is steeper, representing those with large bills that aren't paid either in a lump sum or whose amount due is paid late.

Additionally, it seems like we don't have bad data, as most outliers are balanced between BILL and PAY.

A few observations arise from these charts:

Some feature engineering before predictions

For a larger dataset we can use automatic feature engineering tools such as AutoFeat or FeatureTools, but FeatureTools doesn't quite fit our needs because each individual has such a small amount of data, we can just do data aggregations manually.

We now generate features and see whether the features/transformations generated improve predictions. We assess the latter by training a simple LGBMClassifier and assessing feature importances.

One thing to note is that the feature importances given by LGBMClassifier are given by the number of splits generated by a feature, which isn't always indicroc_auc_score(ytest, y_pred, average='weighted'))ative of the absolute best feature.

Scoring

We measure the accuracy of our predictions by optimizing for a recall-heavy F1 score.

Accuracy measures how many observations, both positive and negative, were correctly classified. Because we care most about false positives - or cases when we say an account won't default but it does, and there are fewer of these than non-defaults, we can't use simple accuracy scores.

Enter the F1 score - a combination of precision and recall. Since precision tells you how many positive identifications were correct and recall tells you what proportion of actual positives was identified correctly, we need to strike a balance between the two.

$$ F_1 = \frac{tp}{tp + \frac{1}{2} (fp + fn)} = \frac{precision \times recall}{precision + recall} $$

When setting the F-beta parameter, we will lean towards recall rather than precision, as one non-default card being classified as default is better than the other way around. This will make us set a higher beta.

To measure the F1 score, we take the roc_auc_score - a measure of the f1 score under various threshold settings; thresholds

Feature Engineering

For sake of avoiding data bias, we don't use SEX, MARRIAGE, or EDUCATION to make further features, as these could be discriminating against customers. Therefore, we only use bills, payments, and payment delays.

Converting Features to Normal Distributions

Linear regressions tend to have a better chance at prediction when the features are standard normal. We can do this by transforming some of our features.

The transformation we do is a Box-Cox Test, which helps us convert the distribution to normal. The Box-Cox test uses a parameter, lambda, to transform log-distributed data containing zeros. The best way to choose lambda is to correlate the transformed distribution with a normal probability plot and pick the lambda that gives the highest correlation.

The features we'll transform are the BILL_AMTs and the PAY_AMTs. In order to not overfit, we'll select the same lambda for all columns.

When scaling the BILL_AMTs, we substitute 0s for the negative (pre-paid) balances. This will make scaling of the features easier.

One-hot encoding of categorical features

Dim Reduction

A good technique to extract the juice out of many features is to do dimensionality reduction. Each new dimension represents a significant representation of the data.

We add these new dimensions to our main dataframe and see if our accuracy improves.

There is some separation, mostly between PCA_2 and both PCA_1 and PCA_3, but nothing outstanding.

Try UMAP

Another dimensionality reduction technique is called UMAP.

Not as good as expected, but UMAP_2 vs. UMAP_3 does show some nice separation of colors...

Try LDA

Evidently, LDA seems like a solid predictor!

How correlated are these features?

# The only correlated variables are lda with 2_pca. # Thresholds of 0.8 are often used, so we'll use that and not drop any of the features above.
Evidently, the accuracy of our model doesn't improve with the new features, and the ROC AUC even diminishes. Let's see if it improves if we only take the best features

Doesn't work. What if we only take lda as it is the feature most correlated to the target?

There is no improvement. All the information captured in the pca/lda is already taken into account by the gradient boosting tree. Therefore, we only pass the best dataset - the one with the manually engineered features which has the highest accuracy and roc auc score.

Ready for Production

We'll pass all of these features on to the model development phase to see if better models helps with accuracy and are capable of exploiting these features.